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Abstract 

The marginal maximum a posteriori probability (MAP) estimation problem, which cal- 
culates the mode of the marginal posterior distribution of a subset of variables with the 
remaining variables marginalized, is an important inference problem in many models, such 
as those with hidden variables or uncertain parameters. Unfortunately, marginal MAP can 
be NP-hard even on trees, and has attracted less attention in the literature compared to 
the joint MAP (maximization) and marginalization problems. We derive a general dual 
representation for marginal MAP that naturally integrates the marginalization and max- 
imization operations into a joint variational optimization problem, making it possible to 
easily extend most or all variational-based algorithms to marginal MAP. In particular, we 
derive a set of "mixed-product" message passing algorithms for marginal MAP, whose form 
is a hybrid of max-product, sum-product and a novel "argmax-product" message updates. 
We also derive a class of convergent algorithms based on proximal point methods, including 
one that transforms the marginal MAP problem into a sequence of standard marginalization 
problems. Theoretically, we provide guarantees under which our algorithms give globally 
or locally optimal solutions, and provide novel upper bounds on the optimal objectives. 
Empirically, we demonstrate that our algorithms significantly outperform the existing ap- 
proaches, including a state-of-the-art algorithm based on local search methods. 
Keywords: Graphical Models, Message Passing, Belief Propagation, Variational Meth- 
ods, Maximum a Posterior, Marginal-MAP, Hidden Variable Models. 



1. Introduction 

Graphical models such as Bayesian networks and Markov random fields provide a powerful 
framework for reasoning about conditional dependency structures over many variables, and 
have found wide application in many areas including error correcting codes, computer vi- 
sion, and computational biology (Wainwright and Jordan, 2008; KoUer and Friedman, 2009). 
Given a graphical model, which may be estimated from empirical data or constructed by 
domain expertise, the term inference refers generically to answering probabilistic queries 
about the model, such as computing marginal probabilities or maximum a posteriori esti- 
mates. Although these inference tasks are NP-hard in the worst case, recent algorithmic 
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advances, including the development of variational methods and the family of algorithms 
collectively called belief propagation, provide approximate or exact solutions for these prob- 
lems in many practical circumstances. 

In this work we will focus on three common types of inference tasks. The first involves 
maximization or max-inference tasks, sometimes called maximum a posteriori (MAP) or 
most probable explanation (MPE) tasks, which look for a mode of the joint probability. 
The second are sum-inference tasks, which include calculating the marginal probabilities or 
the normalization constant of the distribution (corresponding to the probability of evidence 
in a Bayesian network). Finally, the main focus of this work is on marginal MAP, a type of 
mixed-inference problem that seeks a partial configuration of variables that maximizes those 
variables' marginal probability, with the remaining variables summed out.^ A marginal 
MAP problem can arise, for example, as a MAP problem on models with hidden variables 
whose predictions are not of interest, or as a robust optimization variant of MAP with some 
unknown or noisily observed parameters marginalized w.r.t. a prior distribution. It can be 
also treated as a special case of the more complicated frameworks of stochastic programming 
(Birge and Louveaux, 1997) or decision networks (Howard and Matheson, 2005; Liu and 
Ihler, 2012). 

These three types of inference tasks are listed in order of increasing difficulty: max- 
inference is NP-complete, while sum-inference is ^^P-complete, and mixed-inference is NP^^- 
complete (Park and Darwiche, 2004; De Campos, 2011). Practically speaking, max-inference 
tasks have a host of efficient algorithms such as loopy max-product BP, tree-reweighted BP, 
and dual decomposition (see e.g., Koller and Friedman, 2009; Sontag et al., 2011). Sum- 
inference is more difficult than max-inference: for example there are models, such as those 
with binary attractive pairwise potentials, on which sum-inference is ^^P-complete but max- 
inference is tractable (Greig et al., 1989; Jerrum and Sinclair, 1993). 

Mixed-inference is even much harder than either max- or sum- inference problems alone: 
marginal MAP can be NP-hard even on tree structured graphs, as illustrated in the example 
in Fig. 1 (Koller and Friedman, 2009). The difficulty arises in part because the max and 
sum operators do not commute, causing the feasible elimination orders to have much higher 
induced width than for sum- or max-inference. Viewed another way, the marginalization 
step may destroy the dependency structure of the original graphical model, making the 
subsequent maximization step far more challenging. Probably for these reasons, there is 
much less work on marginal MAP than that on joint MAP or marginalization, despite its 
importance to many practical problems. 

Contributions. We reform the mixed-inference problem to a joint maximization prob- 
lem as a free energy objective that extends the well-known log-partition function duality 
form, making it possible to easily extend essentially arbitrary variational algorithms to 
marginal MAP. In particular, we propose a novel "mixed-product" BP algorithm that is 
a hybrid of max-product, sum-product, and a special "argmax-product" message updates, 
as well as a convergent proximal point algorithm that works by iteratively solving pure (or 
annealed) marginalization tasks. We also present junction graph BP variants of our algo- 
rithms, that work on models with higher order cliques. We also discuss mean field methods 
and highlight their connection to the expectation-maximization (EM) algorithm. We give 

1. In some literature (e.g., Park and Darwiche, 2004), marginal MAP is simply referred to as MAP, and 
the joint MAP problem is called MPE. 
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theoretical guarantees on the global and local optimality of our algorithms for cases when 
the sum variables form tree structured subgraphs. Our numerical experiments show that 
our methods can provide significantly better solutions than existing algorithms, including 
a similar hybrid message passing algorithm by Jiang et al. (2011) and a state-of-the-art 
algorithm based on local search methods. 

Related Work. Expectation-maximization (EM) or variational EM provide one straight- 
forward approach for marginal MAP, by viewing the sum nodes as hidden variables and 
the max nodes as parameters to be estimated; however, EM is prone to getting stuck at 
sub-optimal configurations. The classical state-of-the-art approaches include local search 
methods (e.g.. Park and Darwiche, 2004), Markov chain Monte Carlo methods (e.g., Doucet 
et al., 2002; Yuan et al., 2004), and variational elimination based methods (e.g., Dechter and 
Rish, 2003; Maua and de Campos, 2012). Jiang et al. (2011) recently proposed a hybrid 
message passing algorithm that has a similar form to our mixed-product BP algorithm, 
but without theoretical guarantees; we show in Section 5.3 that Jiang et al. (2011) can be 
viewed as an approximation of the marginal MAP problem that exchanges the order of sum 
and max operators. Another message-passing-style algorithm was proposed very recently 
in Altarelli et al. (2011) for general multi-stage stochastic optimization problems based on 
survey propagation, which again does not have optimality guarantees and has a relatively 
more complicated form. Finally, Ibrahimi et al. (2011) introduces a robust max-product 
belief propagation for solving a relevant worst-case robust optimization problem, where the 
hidden variables are minimized instead of marginalized. To the best of our knowledge, our 
work is the first general variational framework for marginal MAP, and provides the first 
strong optimality guarantees. 

We begin in Section 2 by introducing background on graphical models and variational 
inference. We then introduce a novel variational dual representation for marginal MAP 
in Section 3, and propose analogues of the Bethe and tree-reweighted approximations in 
Section 4. A class of "mixed-product" message passing algorithms is proposed and ana- 
lyzed in Section 5 and convergent alternatives are proposed in Section 6 based on proximal 
point methods. We then discuss the EM algorithm and its connection to our framework in 
Section 7, and extend our algorithms to junction graphs in Section 8. Finally, we present 
numerical results in Section 9 and conclude the paper in Section 10. 

2. Background 

2.1 Graphical Models 

Let X = {xi,X2, ■ ■ ■ ,Xn} be a random vector in a discrete space X = Xi x ■ ■ ■ x Xn- For 
an index set a G {1, • • • , n}, let denote by Xa the sub- vector {xi : i G a}, and similarly, Xa 
the cross product of {Afj: i G a}. A graphical model defines a factorized probability on a;, 

p{x) = ^^'4}a{xa) or p(a;;0) = exp[^6lc,(iEa) - $(0)], (1) 

where X is a set of subsets of variable indexes, V'a : '^a — ^ I^"*" is called a factor function, 
and 6a{xa) = logipa{xa)- Since the Xi are discrete, the functions ijj and 9 are tables; 
by alternatively viewing as a vector, it is interpreted as the natural parameter in an 
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over complete, exponential family representation. Let ■?/> and 6 be the joint vector of all ipa 
and 6a respectively, e.g., 9 = {9a{xa): a £ I,Xa € Xq,}. The normalization constant Z, 
called partition function, normalizes the probability to sum to one, and ^(0) := logZ is 
called the log-partition function, 

c^(0)=log^exp[0(a;)], 

where we define 9{x) = to be the joint potential function that maps from X 

to M. The factorization structure of p{x) can be represented by an undirected graph G = 
{V,E), where each node i G V maps to a variable Xj, and each edge (ij) S E corresponds 
to two variables Xi and xj that coappear in some factor function ipa, that is, {i,j} C a. 
The set I is then a set of cliques (fully connected subgraphs) of G. For the purpose of 
illustration, we mainly restrict our scope on the set of pairwise models, on which X is the 
set of nodes and edges, i.e., Z = E U V . However, we show how to extend our algorithms 
to models with higher order cliques in Section 8. 

2.2 Sum-Inference Problems and Variational Approximation 

Sum-inference is the task of marginalizing (summing out) variables in the model, e.g., 
calculating the marginal probabilities of single variables, or the normalization constant Z, 

P{xi) = ^ E exp[0(a;)], Z = ^exp[^(^)]. (2) 

xv\{i} X 

Unfortunately, the problem is generally ^^P-complete, and the straightforward calculation 
requires summing over an exponential number of terms. Variational methods are a class 
of approximation algorithms that transform the marginalization problem into a continuous 
optimization problem, which is then typically solved approximately. 

Marginal Ploytope. The marginal polytope is a key concept in variational infer- 
ence. We define the marginal polytope M to be the set of local marginal probabilities 
T = {Ta{xa) : a £ 1} that are extensible to a valid joint distribution, i.e., 

M = {t : 3 joint distribution q{x), s.t. Ta{xa) = ^ q{x) for Va G I}. (3) 

Denote by Q[t] the set of joint distributions whose marginals are consistent with t E M; 
by the principle of maximum entropy (Jaynes, 1957), there exists a unique distribution in 
Q[t] that has maximum entropy and follows the exponential family form for some 6? With 
an abuse of notation, we denote these unique global distributions by t{x), and we do not 
distinguish t{x) and r when it is clear from the context. 

Log-partition Function Duality. A key result to many variational methods is that 
the log-partition function ^{0) is a convex function of 6 and can be rewritten into a convex 
dual form, 

$(6>) = max {(6>,t) + i7(T)}, (4) 

2. In the case that p{x) has zero elements, the maximum entropy distribution is still unique and satisfies 
the exponential family form, but the corresponding has negative infinite values (Jaynes, 1957). 
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where (0,t) = J2a'l2xa^o'(^'^)'^(^(^o^) vectorized inner product, and H{t) is the 

entropy of the corresponding global distribution t(x), i.e., H{t) = — X^^, t(s) logr(a;). 
The unique maximum r* of (4) exactly equals the marginals of the original distribution 
p{x; 6), that is, t*{x) = p{x; 0). We call FsumiT, 9) = {6, r) + H{t) the sum-inference free 
energy (although technically the negative free energy). 

The dual form (4) transforms the marginalization problem into a continuous optimiza- 
tion, but does not make it any easier: the marginal polytope M is defined by an exponential 
number of linear constraints, and the entropy term in the objective function is as difficult 
to calculate as the log-partition function. However, (4) provides a framework for deriving 
efficient approximate inference algorithms by approximating both the marginal polytope 
and the entropy (Wainwright and Jordan, 2008). 

BP-like Methods. Many approximation methods replace M with the locally consistent 
polytope L(G); in pairwise models, it is the set of singleton and pairwise "pseduo-marginals" 
{Ti{xi)\i G V} and {Tij{xi,Xj)\(ij) G E} that are consistent on their intersections, that is, 

L(G) = {Ti,Tij : '^Tij{Xi,Xj) = Tj{xj),'^Ti{xi) = l,Tij{xi,Xj) > 0}. 

Xi X'l 

Since not all such pseudo- marginals have valid global distributions, it is easy to see that 
L(G) is an outer bound of M, that is, M C L(G). 

The free energy remains intractable (and is not even well-defined) in L(G). We typically 
approximate the free energy by a combination of singleton and pairwise entropies, which 
only requires knowing Tj and Tij. For example, the Bethe free energy approximation (Yedidia 
et al., 2003) is 

/7(t)« J;F,(t)- Y1 ™ {(0,t) + J]//,- ^ii)^ (5) 

where Hi[T) is the entropy of Ti{xi) and lij^T) the mutual information of Xj and xj, i.e., 
Hi{r) = - Tijxi) log Ti{xi), 

We sometimes abbreviate Hi^r) and Iij{T) into Hi and lij for convenience. The well-known 
loopy belief propagation (BP) algorithm of Pearl (1988) can be interpreted as a fixed point 
algorithm to optimize the Bethe free energy in (5) on the locally consistent polytope L(G) 
(Yedidia et al., 2003). Unfortunately, the Bethe free energy is a non-concave function of r, 
causing (5) to be a non-convex optimization. The tree reweighted (TRW) free energy is a 
convex surrogate of the Bethe free energy (Wainwright et al., 2005a), 

H{r)^Y.^^- E P^i^^^^ ™ {(^'^)+E^^M- E PiiW]^ (6) 

where {pij : (ij) £ E} is a set of positive edge appearance probabilities obtained from a 
weighted collection of spanning trees of G (see Wainwright et al. (2005a) and Section 4.2 for 
the detailed definition). The TRW approximation in (6) is a convex optimization problem, 
and is guaranteed to give an upper bound of the true log-partition function. A message 



^iji'^) = E T-ij(Xi,Xj)log ^"^^ 
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passing algorithm similar to loopy BP, called tree reweighted BP, can be derived as a fixed 
point algorithm for solving the convex optimization in (6). 

Mean-field-based Methods. Mean-field-based methods are another set of approxi- 
mate inference algorithms, which work by restricting M to a set of tractable distributions, 
on which both the marginal polytope and the joint entropy are tractable. Precisely, let 
M.^f be a subset of M that corresponds to a set of tractable distributions, e.g., the set of 
fully factored distributions, M^/ = {r G M: t{x) = HjeF The mean field methods 

approximate the log-partition function (4) by 

max {{e,T)+H{T)], (7) 

which is guaranteed to give a lower bound of the log-partition function. Unfortunately, 
mean field methods usually lead to non-convex optimization problems, because M.^/ is 
often non-convex set. In practice, block coordinate descent methods can be adopted to find 
the local optima of (7). 

2.3 Max-Inference Problems 

Combinatorial maximization (max-inference), or maximum a posteriori (MAP), problems 
are the tasks of finding a mode of the joint probability. That is, 

^oo{(^) = T^Si.y:6{xa), x* = aicgm.ax6{xa)- (8) 

where x* is a MAP configuration and $oo(^) the optimal energy value. This problem can 
be reformed into a linear program, 

$ooW=max(0,r), (9) 

tGM 

which attains its maximum when t*{x) = l{x = x*), where l(-) is the Kronecker delta 
function, defined as = 1 if condition t is true, and zero otherwise. If there are multiple 
MAP solutions, say {x*^ : k = 1, . . . , K}, then any convex combination ^j^, Cfcl(a; = x*'^) 
with Ylk Cfc = 1, Cj > leads to a maximum of (9). 

The problem in (9) remains NP-hard, because of the intractability of marginal polytope 
M. Most variational methods for MAP (e.g., Wainwright et al., 2005b; Werner, 2007) can be 
interpreted as relaxing M to the locally consistent polytop L(G), yielding a linear relaxation 
of the original integer programming problem. Note that (9) differs from (4) only by its lack 
of an entropy term; in the next section, we generalize this similarity to marginal MAP. 

2.4 Marginal MAP Problems 

Marginal MAP is simply a hybrid of the max- and sum- inference tasks. Let ^ be a subset 
of nodes V, and B = V\A be the complement of A. The marginal MAP problem seeks a 
partial configuration x*^ that has the maximum marginal probability p{xb) = S^^p(^); 
where A is the set of sum nodes to be marginalized out, and B the max nodes to be 
optimized. We call this a type of "mixed-inference" problem, since it involves more than 
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Marginal MAP: 
x*Q = aigmaxp{xB) 

= argmax> p{x). 

Xb 

XA 

Figure 1: An example from Roller and Friedman (2009) in which a marginal MAP query 
on a tree requires exponential time complexity. The marginalization over xa 
destroys the conditional dependency structure in the marginal distribution ^(015), 
causing an intractable maximization problem over xb- The complexity of the 
exact variable elimination method is 0(exp(n)), where n is the length of the 
chain. 



one type of variable elimination operator. In terms of the exponential family representation, 
marginal MAP can be formulated as 

^ab{0) = raaxQ{xB;9), where Q{xb;9) = log'S^eyipCy.Oaixa)). (10) 
Xb 

XA ael 

Although similar to max- and sum-inference, marginal MAP is significantly harder than 
either of them. A classic example is shown in Fig. 1, where marginal MAP is NP-hard 
even on a tree structured graph (Koller and Friedman, 2009). The main difficulty arises 
because the max and sum operators do not commute, which restricts feasible elimination 
orders to those with all the sum nodes eliminated before any max nodes. In the worst case, 
marginalizing the sum nodes Xj^ may destroy any conditional independence among the max 
nodes xb, making it difficult to represent or optimize Q{xb',0), even when the sum part 
alone is tractable (such as when the nodes in A form a tree). 

Despite its computational difficulty, marginal MAP plays an essential role in many 
practical scenarios. The marginal MAP configuration x*^ in (10) is Bayes optimal in the 
sense that it minimizes the expected error on B, E[l(£c^ = a^^)], where E[-] denotes the 
expectation under distribution p{x; 6). Here, the variables xa are not included in the error 
criterion, for example because they are "nuisance" hidden variables of no direct interest, 
or unobserved or inaccurately measured model parameters. In contrast, the joint MAP 
configuration x* minimizes the joint error E[l(a;* = a;)], but this gives no guarantees on 
the partial error E[l(£c^ = xb)]- In practice, perhaps because of the wide availability of 
efficient algorithms for joint MAP, researchers tend to over-use joint MAP even in cases 
where marginal MAP would be more appropriate. The following toy example shows that 
this seemingly reasonable approach can sometimes cause serious problems. 

Example 1 (Weather Dilemma). Denote by Xh G {rainy, sunny} the weather condition of 
Irvine, and Xa £ {walk, drive} whether Alice drives or walks to the school depending on 
the weather condition. Assume the probabilities of Xb and Xa are 
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p{xb) ■■ 


rainy 


0.4 


p{Xa\xb) ■■ 




walk 


drive 




sunny 


0.6 




rainy 


1/8 


7/8 










sunny 


1/2 


1/2 



The task is to calculate the most likely weather condition of Irvine, which is obviously sunny 
according to p{xb)- The marginal MAP, = argmax^.^ p(xfe) = sunny, gives the correct 
answer. However, the full MAP estimator, = argmax p{xa, x^) = [drive, rainy], 

gives answer xl = rainy (by dropping the x* component), which is obviously wrong. Para- 
doxically, if p{xa\xh) is changed (say, corresponding to a different person), the solution 
returned by full MAP could be different. 

In the above example, since no evidence on Xa is observed, the conditional probability 
p{xa\xh) does not provide useful information for Xb, but instead provides misleading infor- 
mation when it is incorporated in the full MAP estimator. The marginal MAP, on the 
hand, eliminates the influence of the irrelevant p{xa\xb) by marginalizing (or averaging) Xq. 
In general, the marginal MAP and full MAP can differ significantly when the uncertainty 
in the hidden variables changes as a function oi xb- 



3. A Dual Representation for Marginal MAP 

In this section, we present our main result, a dual representation of the marginal MAP 
problem (10). Our dual representation generalizes that of sum-inference in (4) and max- 
inference in (9), and provides a unified framework for solving marginal MAP problems. 

Theorem 2. The marginal MAP energy ^^3(9) in (10) has a dual representation, 

^AB{e) = max{(0, t) + Haib{t)}, (11) 

where Ha\b{'^) ^ conditional entropy, Ha\b{'^) = ~ Z^tc ''"(^) ^°S''"('^-4|'^-b)- Qi^B]9) 
has a unique maximum x*^, the maximum point t* of (11) is also unique, satisfying t* (x) = 
t*{xb)t*{xa\xb), where t*{xb) = '^{xb = x*^) and t*{xa\xb) = p{xa\xb; 0). ^. 

Proof. For any r S M and its corresponding global distribution t{x), consider the condi- 
tional KL divergence between t(xa\xb) and p{xa\xb', d), 

Dkl[t{xa\xb)\\p{xa\xb;0)] = ^ t{x) log J^!^^}^^^q^^ 

= -HaibM - ^r[logp{xA\xB; 0)] 

= -Ha\b{t) - ¥.r[e{x)] + MQ{xB\ e)] > 0, 

where Ha\b{'^) is the conditional entropy on t{x)] the equality on the last line holds because 
p{xa\xb',0) = exp(0(a;) — Q{xb;0)); the last inequality follows from the nonnegativity of 
KL divergence, and is tight if and only if t{xa\xb) = p{xa\xb', d) for all xa and xb that 
t{xb) 7^ 0. Therefore, we have for any t{x), 

^AB{0) = \TiaxQ{xB]9)>¥.r[Q{xB]9)]>¥.r[e{x)]+HA\B{'r). 



3. Since t{xb) = if xb x*b, wc do not necessarily need to define t* {xa\xb) for xb 7^ x*b- 
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Problem Type 


Primal Form 


Dual Form 


Max-Inference 


log maxexp{9{x)) 

X 


max{(0,r)} 


Sum-Inference 


log^exp(6'(a:;)) 

X 


max{(0, r)+if(r)} 


Marginal MAP 


log max ^ eyip(9(x)) 
xb ^-^ 

a; .4 


max{(6»,T) + F^|B(r)} 



Table 1: The primal and dual forms of the three inference types. The dual forms of sum- 
inference and max-inference are well known; the form for marginal MAP is a 
contribution of this work. Intuitively, the max vs. sum operators in the primal 
form determine the conditioning set of the conditional entropy term in the dual 
form. 



It is easy to show that the two inequality signs are tight if and only if t(x) equals t*{x) as 
defined above. Substituting ¥,r[6{x)] = {6,t) completes the proof. □ 

Remark 1. If Q{xb',0) has multiple maxima {x*^}, each corresponding to a distri- 
bution T*^{x) = 1{xb = x*^)p{xa\xb;0), then the set of maximum points of (11) is the 
convex hull of {r*'^}. 

Remark 2. Theorem 2 naturally integrates the marginalization and maximization sub- 
problems into one joint optimization problem, providing a novel and efficient treatment 
for marginal MAP beyond the traditional approaches that treat the marginalization sub- 
problem as a sub-routine of the maximization problem. As we show in Section 5, this enables 
us to derive efficient "mixed-product" message passing algorithms that simultaneously takes 
marginalization and maximization steps, avoiding expensive and possibly wasteful inner loop 
steps in the marginalization sub-routine. 

Remark 3. Since we have ffyi|^(T) = H{t) — Hb{t) by the entropic chain rule (Cover 
and Thomas, 2006), the objective function in (11) can be view as a "truncated" free energy, 

Frmxir, e) := {6, r) + Ha\b{t) = Fsum{T, e) - Hb{t), 

where the entropy Hb{t) of the max nodes xb are removed from the regular sum- inference 
free energy Fsum,{T, ^) = 't) + H{t). Theorem 2 generalizes the dual form of both sum- 
inference (4) and max-inference (9), since it reduces to those forms when the max set B is 
empty or all nodes, respectively. Table 1 shows all three forms together for comparision. 
Intuitively, since the entropy Hb{t) is removed from the objective, the optimal marginal 
t*{xb) tends to have lower entropy and its probability mass concentrates on the optimal 
configurations {a;^}. Alternatively, the t*(x) can be interpreted as the marginals obtained 
by clamping the value of aj^ at x*^ on the distribution p(x; 6), i.e., t*{x) = p{x\xb = x*^; 0). 

Remark 4. Unfortunately, subtracting the Hb{t) term causes some subtle difficulties. 
First, Hb{t) (and hence Fmix{T,9)) may be intractable to calculate even when the joint 
entropy H{t) is tractable, because the marginal distribution p{xb) = J2xaP^^^ does not 
necessarily inherit the conditional dependency structure of the joint distribution. Therefore, 
the dual optimization in (11) may be intractable even on a tree, reflecting the intrinsic 
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difficulty of marginal MAP compared to full MAP or marginalization. Interestingly, we 
show in the sequel that a certificate of optimality can still be obtained on general tree 
graphs in some cases. 

Secondly, the conditional entropy //^[^(t) (and hence Fmixi''',^)) is concave, but not 
strongly concave, with respect to r. This creates additional difficulty when optimizating 
(11), since many iterative optimization algorithms, such as coordinate descent, can lose their 
typical convergence or optimality guarantees when the objective function is not strongly 
convex. 

Smoothed Approximation. To sidestep the issue of non-strong convexity, we intro- 
duce a smoothed approximation of Fmixi'T-, ^) that "adds back" part of the missing Hb{t) 
term, 

FLix{r, 0) = {e,T) + HAiBir) + eHBir), 

where e is a small positive constant. This smoothed dual approximation is closely connected 
to a direct approximation in the primal domain, as shown in the following Theorem. 

Theorem 3. Let e be a positive constant, and Q{xb',0) as defined in (10). Define 

^^^BiO) = log{[Y^exj>{Q{xB;e))'/r}, 



Xb 



then we have 



In addition, we have 



#^^(0) =max{(0,T) +i?^|5(T) + ei/B(r)}. (12) 



lim ^\b{0) = ^ab{0) 
, where e — )• 0'^ denotes approaching zero from the positive side. 

Proof. The proof is similar to that of Theorem 2, but exploits the non- negativity of a 
weighted sum of two KL divergence terms, 

BKL[r{XA\xB)\\p{xA\XB;0)] + e'DKL[r{XB)\\p{XB)]. 

The remaining part follows directly from the standard zero temperature limit formula, 

hm [V/(x)i/T=max/(x), (13) 

X 

where /(x) is any function with positive values. □ 



4. Variational Approximations for Marginal MAP 

Theorem 2 transforms the marginal MAP problem into a variational form, but obviously 
does not decrease its computational hardness. Fortunately, many well-established varia- 
tional techniques for sum- and max- inference can be extended to apply to (11), opening 
a new door for deriving novel approximate algorithms for marginal MAP. In the spirit of 
Wainwright and Jordan (2008), one can either relax M to a simpler outer bound like L(G) 
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and replace Fmixi''',^) by some tractable form to give algorithms similar to loopy BP or 
TRW BP, or restrict M to a tractable subset like M^/ to give mean-field-like algorithms. 
In the sequel, we demonstrate several such approximation schemes, mainly focusing on the 
BP-like methods with pairwise free energies. We will briefly discuss mean-field-like meth- 
ods when we connect to EM in section 7, and derive an extension to junction graphs that 
exploits higher order approximations in Section 8. Our framework can be easily adopted to 
take advantage of other, more advanced variational techniques, like those using higher order 
cliques (e.g., Yedidia et al., 2005; Globerson and Jaakkola, 2007; Liu and Ihler, 2011; Kazan 
et al., 2012) or more advanced optimization methods like dual decomposition (Sontag et al., 
2011) or alternating direction method of multipliers (Boyd et al., 2010). 

We start by characterizing the graph structure on which marginal MAP is tractable. 

Definition 4.1. We call G an A-B tree if there exists a partial order on the node set 
V = A[J B , satisfying 

1) Tree-order. For any i £V , there is at most one other node j G V (called its parent), 

such that j ~< i and (ij) G E; 

2) A-B Consistency. For any a £ A and b £ B, we have b ~< a. 
We call such a partial order an A-B tree-order of G. 

For further notation, let Ga = {A,Ea) be the subgraph induced by nodes in A^ i.e., 
Ea = {{ij) G E\i e A,j G A}, and similarly for Gb = {B,Eb)- Let Sab = {{ij) G E\i G 
A,j £ B} be the edges that join sets A and B. 

Obviously, marginal MAP on an A-B tree can be tractably solved by sequentially elim- 
inating the variables along the A-B tree-order (see e.g., Koller and Friedman, 2009). We 
show that its dual optimization is also tractable in this case. 

Lemma 4. // G is an A-B tree, then 

1) The locally consistent polytope equals the marginal polytope, that is, M = L(G'). 

2) The conditional entropy has a pairwise decomposition, 



Proof. 1) The fact that M = L(G) on trees is a standard result; see Wainwright and Jordan 
(2008) for details. 

2) Because G is an A-B tree, both p{x) and p{xb) have tree structured conditional depen- 
dency. We then have (see e.g., Wainwright and Jordan, 2008) that 




(14) 



and 



idV {ij)GE 



i&B (ij)eEB 



Equation (14) follows by using the entropic chain rule Ha\b(,'^) = H(t) — Hb{t). 



□ 
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4.1 Bethe-like Free Energy 

Lemma 4 suggests that the free energy of A-B trees can be decomposed into singleton and 
pairwise terms that are easy to deal with. This is not true for general graphs, but motivates 
a "Bethe" like approximation, 

^bethe{0)= max FbetheiT,e), Fbethe{r , 6) = {6 , t) + ^ Hi - ^/ij, (15) 

^GA {ij)eEAUdAB 

where we call FhetheiT,0) is a "truncated" Bethe free energy, whose entropy and mutual 
information terms that involve only max nodes are truncated. If G is an A-B tree, ^bethe 
equals the true ^ab, giving an intuitive justification. In the sequel we give more general 
theoretical conditions under which this approximation gives the exact solution, and we find 
empirically that it usually gives surprisingly good solutions in practice. Similar to the 
regular Bethe approximation, (15) leads to a nonconvex optimization, and we will derive 
both message passing algorithms and provably convergent algorithms to solve it. 

4.2 Tree-reweighted Free Energy 

Following the idea of TRW belief propagation (Wainwright et al., 2005a), we construct an 
approximation of marginal MAP using a convex combination of A-B subtrees (subgraphs 
of G that are A-B trees). Let Tab be a collection of A-B subtrees of G. We assign with 
each T £ Tab a weight wt satisfying wt > and X^tgTab ~ ^" each A-B sub-tree 
r= (F,^t), define 

HAlBir;T) = Y,H^- ^ 

i&A {ij)eET\EB 

As shown in Wainwright and Jordan (2008), the Hj^^q{t ; T) is always a concave function 
of T on L(G), and Hj^^^^t) < Hy^^^^r ; T) for all r S M and T £ Tab- More generally, we 
have Hj^\b{t) < 'Y^t^Tab '^'tHa\b{''' ! T), which can be transformed to 

HA\B{r)<Y.Hi - p,jli„ (16) 

iGA {ij)€EAUdAB 

where pij = YlT-{ij)&ET '^'^ edge appearance probabilities as defined in Wainwright 

and Jordan (2008). Replacing M with L(G) and Hji\q{t) with the bound in (16) leads to 
a TRW-like approximation of marginal MAP, 

^'i™(6>) = max Ftrwir,9), Fi^^(T, 0) = (0, r) + - ^ p.jlij. (17) 

*eA iij)eEAUdAB 

Since L(G) is an outer bound of M, and Ftrw is a concave upper bound of the true free energy, 
we can guarantee that ^trwi^) is always an upper bound of ^ABi^)- To our knowledge, 
this provides the first known convex relaxation for upper bounding marginal MAP. One can 
also optimize the weights {wt'- T £ Tab} to get the tightest upper bound using methods 
similar to those used for regular TRW BP (see Wainwright et al., 2005a). 
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4.3 Global Optimality Guarantees 

We show the global optimality guarantees of the above approximations under some circum- 
stances. In this section, we always assume Ga is a tree, and hence the objective function 
is tractable to calculate for a given xb- However, the optimization component remains 
intractable in this case, because the marginalization step destroys the decomposition struc- 
ture of the objective function (see Fig. 1). It is thus nontrivial to see how the Bethe and 
TRW approximations behave in this case. 

In general, suppose we approximate ^AB{d) using the following pairwise approximation. 



where the weights on the sum part, {pij: {ij) G Ea}, have been fixed to be ones. This 
choice makes sure that the sum part is "intact" in the approximation, while the weights 
on the crossing edges, p^^ — {Pij- i^J) ^ dAs}, can take arbitrary values, corresponding 
to different free energy approximation methods. If pij = 1 for £ Oab, it is the Bethe 
free energy; it will correspond to the TRW free energy if {pij} are taken to be a set of edge 
appearance probabilities (which in general have values less than one). The edge appearance 
probabilities of A-B trees are more restrictive than for the standard trees used in TRW 
BP. For example, if the max part of a A-B sub-tree is a connected tree, then it can include 
at most one crossing edge, so in this case Pab should satisfy X](ij)e94s Pv ~ Pv — ^■ 
Interestingly, we will show in Section 7 that if pij — )■ +oo for V(ij) G Bab, then Equation (18) 
is closely related to to an EM algorithm. 

Theorem 5. Suppose the sum part Ga is a tree, and we approximate ^AB{d) using ^treei^) 
defined in (18). Assume that (18) is globally optimized. 

(i) We have ^treei^) > ^ab{(^)- If the there exists x*^ such that Q{x*q;6) = <^tree{(^), 
we have ^treei^) = ^ab(^); o,nd x*^ is a globally optimal marginal MAP solution. 

(ii) Suppose t* is a global maximum of (18), and {T*(xi)|i G B} have integral values, 
i.e., T^{xi) = or 1, then {x* = argmax^^. T*{xi) : i G i?} is a globally optimal solution 
of the marginal MAP problem (10). 

Proof (sketch). (See appendix for the complete proof.) The fact that the sum part Ga is a 
tree guarantees the marginalization is exact. Showing (18) is a relaxation of the maximiza- 
tion problem and applying standard relaxation arguments completes the proof. □ 

Remark. Theorem 5 works for arbitrary values of Pab, ^.nd suggests a fundamental 
tradeoff of hardness as Pab takes on different values. On the one hand, the value of 
Pab controls the concavity of the objective function in (18) and hence the difficulty of 
finding a global optimum; small enough (as in TRW) can ensure that (18) is a convex 
optimization, while larger p^g (as in Bethe or EM) causes (18) to become non-convex, 
making it difficult to apply Thoerem 5. On the other hand, the value of Pab ^-Iso controls 
how likely the solution is to be integral - larger pij emphasizes the mutual information terms, 
forcing the solution towards integral points. Thus the solution of the TRW free energy is less 
likely to be integral than the Bethe free energy, causing a difficulty in applying Theorem 5 




(18) 
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to TRW solutions as well. The TRW approximation Pij — 1) ^IM (pij — +00; see 
Section 7) reflect two extrema of this tradeoff between concavity and integrality, respectively, 
while the Bethe approximation (pij = 1) appears to represent a reasonable compromise that 
often gives excellent performance in practice. In Section 5.2, we give a different set of local 
optimality guarantees that are derived from a reparameterization perspective. 



5. Message Passing Algorithms for Marginal MAP 

We now derive message-passing-style algorithms to optimize the "truncated" Bethe or TRW 
free energies in (15) and (17). Instead of optimizing the truncated free energies directly, we 
leverage the results of Theorem 3 and consider their "annealed" versions, 

max {(6>,t) + ^^[^(-r) + e-^B(r)}, 

TgL(G) 

where e is a positive annealing coefficient (or temperature), and the Hj^^b{t) and Hb{t) 
are the generic pairwise approximations of Hj^\b{t) and Hb{t), respectively. That is, 

^A|b(t) = ^-f^j - ^ Pijiij, and HB{T) = ^Hi -'YjPijhj, (19) 

where different values of pairwise weights {pij} correspond to either the Bethe approxima- 
tion or the TRW approximation. This yields a generic pairwise free energy optimization 
problem, 

max {(0,t) + V'-WiFi - y2 ^ij^ij}^ (20) 
where the weights {wi,Wij} are determined by the temperature e and {pij} via 



e Vi G \ ep^J y{ij) G Eb. 

The general framework in (20) provides a unified treatment for approximating sum-inference, 
max-inference and mixed, marginal MAP problems simply by taking different weights. 
Specifically, 

1. If tUj = 1 for all i ^ V, (20) corresponds to the sum-inference problem and the sum- 
product BP objectives and algorithms. 

2. Wi = for all i ^ V, (20) corresponds to the max-inference problem and the 
max-product linear programming objective and algorithms. 

3. If tt^i = 1 for G A and Wi = for \/i G B, (20) corresponds to the marginal MAP 
problem; in the sequel, we derive "mixed-product" BP algorithms. 

Note the different roles of the singleton and pairwise weights: the singleton weights {wi : i G 
V} define the type of inference problem, while the pairwise weights {wij : (ij) G E} deter- 
mine the approximation method (e.g., Bethe vs. TRW). 
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Algorithm 1 Annealed BP for Marginal MAP 

Define the pairwise weights {pij- (ij) G E}, e.g., pij = 1 for Bethe or valid appearance 
probabilities for TRW. Initialize the messages {rrii-^j: (ij) G E}. 
for iteration t do 

1. Update A* by A* = 1/t, and correspondingly the weights {wijWij} by (21). 

2. Perform the message passing update in (22) for all edges {ij) £ E. 
end for 

Calculate the singleton beliefs bi{xi) and decode the solution x*^, 

X* = argmax5j(xi), Vi G B, where bi{xi) oc ^pi{xi)in,^i{xi). (24) 



We now derive a message passing-style algorithm for solving the generic problem (20). 
Assuming Wi and Wij are strictly positive, using a Lagrange multiplier method similar to 
Yedidia et al. (2005) or Wainwright et al. (2005a), we can show that the stationary points 
of (20) satisfy the fixed point condition of the following message passing update. 

Message Update: mi^j{xj) ^ [^(V'^m^i)^/'"'(V^^J/mJ^^)^/"''^■]"''^ (22) 

Xi 

Marginal Decoding: Ti{xi) oc (V'^m^j)^/'^', njixij) oc TiTj{ )^/"''^ (23) 

where m^i{xi) is the product of messages sending into node i, that is, 

m^iixi) = mk^i{xi). 
kadi 

The above message update is mostly similar to TRW-BP of Wainwright et al. (2005a), except 
that it incorporates general singleton weights Wi . The marginal MAP problem can be solved 
by running (22) with {wi,Wij} defined by (21) and a scheme for choosing the temperature 
e, either directly set to be a small constant, or gradually decreased (or annealed) to zero 
through iterations, e.g., by e = 1/t where t is the iteration. Algorithm 1 describes the 
details for the annealing method. 

5.1 Mixed-Product Belief Propagation 

Directly taking e — t- 0^ in message update (22), we can get an interesting "mixed-product" 
BP algorithm that is a hybrid of the max-product and sum-product message updates, with 
a novel "argmax-product" message update that is specific to marginal MAP problems. This 
algorithm is listed in Algorithm 2, and described by the following proposition: 

Proposition 6. As e approaches zero from the positive side, that is, e — t- 0+, the message 
update (22) reduces to the update in (25) -(27) in Algorithm 2. 

Proof. For messages from i G A to j G ^ U i?, we have Wi = 1, Wij = pij; the result is 
obvious. 
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Algorithm 2 Mixed-product Belief Propagation for Marginal MAP 

Define tlie pairwise weiglits {pij : (ij) G E} and initialize messages {rrii^j : (ij) G E} as 
in Algorithm 1. 
for iteration t do 
for edge (ij) € E do 

Perform different message updates depending on the node type of the source and 
destination, 

AUB : rui^i ^ \y^(ipim^i)i^^)'^^''''Y'' , (sum-product) (25) 

B ^ B : rrii^j ■(^ m.gLyi{'>pimr^i)^^^ [ — (max-product) (26) 

— 

B^A: nii^i ^ \ (iPim^i)(^i^)^/P'^Y'', (argmax-product) (27) 

where the set X* = argmax^j(xi)m^j(xj) and m,^i{xi) = rnki{x 

, „ ked, 
end tor 

end for 

Calculate the singleton beliefs bi{xi) and decode the solution x*^, 

X* = argmax6j(xi), Vi G B, where 6j(xj) oc 'ipi{xi)mr~^i{xi). (28) 



For messages from i ^ B to j G B, we have Wi = e, Wij = epij. The result follows from the 
zero temperature limit formula in (13), by letting f{xi) = {''pim^iY^i {-^^^). 
For messages from i G 5 to j G ^4, we have Wi = e, Wij = pij. Let X* = argmax^,. tpinir^i 
be the set of maximizing arguments of the belief bi, and Cj = max^;. ipirrir^i its maximum 
value; one can show that 



lim 



Ci 



Plugging this into (22) and dropping the constant Cj, we get the message update in (27). □ 

Algorithm 2 has an intuitive interpretation: the sum-product and max-product messages 
in (25) and (26) correspond to the marginalization and maximization steps, respectively. 
The special "argmax-product" messages in (27) serves to synchronize the sum-product and 
max-product messages - it restricts the max nodes to the currently decoded local marginal 
MAP solutions = argmax^j(xj)m^j(xj), and passes the posterior beliefs back to the 
sum part. Note that the summation notation in (27) can be ignored if Af* has only a single 
optimal state. 

One critical feature of our mixed-product BP is that it takes simultaneous movements 
on the marginalization and maximization sub-problems in a parallel fashion, and is com- 
putationally much more efficient than the traditional methods that require fully solving a 
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marginalization sub-problem before taking each maximization step. This advantage is inher- 
ited from our general variational framework, which naturally integrates the marginalization 
and maximization sub-problems into a joint optimization problem. 

Interestingly, Algorithm 2 also bears similarity to a recent hybrid message passing 
method of Jiang et al. (2011), which differs from Algorithm 2 only in replacing the special 
argmax-product messages (27) with regular max-product messages. We make a detailed 
comparison of these two algorithms in Section 5.3, and show that it is in fact the argmax- 
product messages (27) that lends our algorithm several appealing optimality guarantees. 

5.2 Reparameterization Interpretation and Local Optimality Guarantees 

An important interpretation of the sum-product and max-product BP is the reparameteri- 
zation viewpoint (Wainwright et al., 2003; Weiss et al., 2007): Message passing updates can 
be viewed as moving probability mass between local pseudo-marginals (or beliefs), in a way 
that leaves their product a reparameterization of the original distribution, while ensuring 
some consistency conditions at the fixed points. Such viewpoints are theoretically impor- 
tant, because they are useful for proving optimality guarantees for the BP algorithms. In 
this section, we show that the mixed-product BP in Algorithm 2 has a similar reparam- 
eterization interpretation, based on which we establish a local optimality guarantee for 
mixed-product BP. 

To start, we define a set of "mixed-beliefs" as 

bi{xi) ^ iPim^i, bijixij) ^bibji )i/^»^'. (29) 

The marginal MAP solution should be decoded from x* £ argmax^,- 6j(xi),Vi S i?, as is 
typical in max-product BP. Note that the above mixed-beliefs {bi,bij} are different from 
the local marginals {rj, Tij} defined in (23), but are rather softened versions of {rj, Tjj}. Their 
relationship is explicitly clarified in the following. 

Proposition 7. The {Ti,Tij} in (23) and the {bi,bij} in (29) are associated via, 

bi ocTi ViG A, ibij oc bibji^) V(ij) gEaU Oab 

h^inf yieB [bij^bibji^Y y{ij)eEB. 

Proof. The result follows from the simple algebraic transformation between (23) and (29). 

□ 

Therefore, as e — ?■ 0^, the Tj (= b\^'') for i ^ B should concentrate their mass on a 
deterministic configuration, but bi may continue to have soft values. 

We now show that the mixed-beliefs {bi,hij} have a reparameterization interpretation. 

Theorem 8. At the fixed point of mixed-product BP in Algorithm 2 , the mixed-beliefs 

defined in (29) satisfy 

Reparameterization: 

K-)ocnMx.) n [^I^ItIi]"^- 
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Mixed- consistency: 

(a) 

Oij [Xi , Xj 

(b) maxbij{xi, Xj) = bj{xj), \/i£B,j£B, (32) 

(c) ^ bij{xi,Xj) = bj{xj), Vi G j G A. (33) 

XiSarg max 

Proof. Directly substitute the definition (29) into the message update (25)- (27). □ 

The three mixed-consistency constraints exactly map to the three types of message up- 
dates in Algorithm 2. Constraint (a) and (b) enforces the regular sum- and max- con- 
sistency of the sum- and max- product messages in (25) and (26), respectively. Con- 
straint (c) corresponds to the argmax-product message update in (27): it enforces the 
marginals to be consistent after Xi is assigned to the currently decoded solution, Xi = 
argmaxa;. bi{xi) = argmax^;^ Ylx ^iji^ii^j)-! corresponding to solving a local marginal MAP 
problem on bij{xi, xj). It turns out that this special constraint is a crucial ingredient of 
mixed-product BP, enabling us to prove guarantees on the strong local optimality of the 
solution. 

Some notation is required. Suppose C is a subset of max nodes in B. Let GcuA = 
(C U A,Ecua) be the subgraph of G induced by nodes C U A, where EquA = {(u) ^ 
E: i,j £ CU A}. We call GcuA a semi- A-B subtree of G if the edges in Ecua\Eb form an 
A-B tree. In other words, GcuA is a semi-A-B tree if it is an A-B tree when ignoring any 
edges entirely within the max set B. See Fig. 2 for examples of semi A-B trees. 

Following Weiss et al. (2007), we say that a set of weights {pij} is provably convex if there 
exist positive constants Ki and Ki^j, such that Ki + Y2i'edi ^i'^i — 1 and n-i^j + Hj^i = Pij. 
Weiss et al. (2007) shows that if {pij} is provably convex, then H{t) = J2i ~ Pij^ij 
is a concave function of r in the locally consistent polytope L(G). 

Theorem 9. Suppose G is a subset of B such that GcuA is a semi-A-B tree, and the 
weights {pij} satisfy 

1. Pij = 1 for (ij) G Ea; 

2. 0< Pij < 1 for (ij) G Ecu A n Oab; 

3. {pij\{ij) G EcuA n Eb} is provably convex. 

At the fixed point of mixed-product BP in Algorithm 2, if the mixed-beliefs bi, bij in (29) all 
have unique maxima, then there exists a B- configuration x*^ satisfying x* = argmaxftj for 
yi E B and (x*,x*) = argmaxftjj for\/{ij) G Eb, and x*^ is locally optimal in the sense 
that Q{x*^; 6) is not smaller than any B -configuration that differs from x*^ only on C, that 
is, Q{x*^; 0) = maxa,^ Q{[xc, 6). 

Proof (sketch). (See appendix for the complete proof.) The mixed-consistency constraint 
(c) in (33) and the fact that GcuA is a semi-A-B tree enables the summation part to be 
eliminated away. The remaining part only involves the max nodes, and the method in Weiss 
et al. (2007) for analyzing standard MAP can be applied. □ 
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(a) (b) (c) 

Figure 2: Examples of semi A-B trees. The shaded nodes represent sum nodes, while the 
unshaded are max nodes. In each graph, a semi A-B tree is labeled by red 
bold lines. Theorem 9 shows that the fixed point of mixed-product BP is locally 
optimal up to jointly perturbing all the max nodes in any semi- A-B subtree of G. 



For GcuA to be a semi A-B tree, the sum part Ga must be a tree, which Theorem 9 
assumes implicitly. For the hidden Markov chain in Fig. 1, Theorem 9 implies only the local 
optimality up to Hamming distance one (or coordinate- wise optimality), because any semi 
A-B subtree of G in Fig. 1 can contain at most one max node. However, Theorem 9 is in 
general much stronger, especially when the sum part is not fully connected, or when the max 
part has interior regions disconnected from the sum part. As examples, see Fig. 2(b)-(c). 

5.3 The importance of the Argmax-product Message Updates 

Jiang et al. (2011) proposed a similar hybrid message passing algorithm, repeated here as 
Algorithm 3, which differs from our mixed-product BP only in replacing our argmax-product 
message update (27) with the usual max- product message update (26). 

Algorithm 3 Hybrid Message Passing by Jiang et al. (2011) 

1. Message Update: 

A^AUB: rrii^i ^ l^" (7pim^^)(^^^)^^P'^Y'\ (sum-product) 

B ^ AU B : mi-).j <— max(V'im^i)''*^ ( — —)■ (max-product) 

2. Decoding: x* = argmax^^ bi{xi) for Vi G B, where 6i(xj) oc ipi{xi)m,^i{xi). 



Similar to our mixed-product BP, Algorithm 3 also satisfies the reparameterization prop- 
erty in (30) (with beliefs {bi,bij} defined by (29)); it also satisfies a set of similar, but 
crurially different, consistency conditions at its fixed points, 

'^bij{xi,Xj) = bj{xj), Mi £ A,j £ Au B, 

ma,xbij{xi, Xj) = bj{xj), Mi £ B,j £ AU B, 

Xi 

which exactly map to the max- and sum- product message updates in Algorithm 3. 

Despite its striking similarity. Algorithm 3 has very different properties, and does not 
share the appealing variational interpretation and optimality guarantees that we have 
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demonstrated for mixed-product BP. First, it is unclear whether Algorithm 3 can be in- 
terpreted as a fixed point algorithm for maximizing our, or a similar, variational objective 
function. Second, it does not inherit the same optimality guarantees in Theorem 9, de- 
spite its similar reparameterization and consistency conditions. These disadvantages are 
caused by the lass of the special argmax-product message update and its associated mixed- 
consistency condition in (33), which was a critical ingredient of the proof of Theorem 9. 

More detailed insights into Algorithm 3 and mixed-product BP can be obtained by con- 
sidering the special case when the full graph G is an undirected tree. We show that in this 
case, Algorithm 3 can be viewed as optimizing a set of approximate objective functions, 
obtained by rearranging the max and sum operators into orders that require less computa- 
tional cost, while mixed-product BP attempts to maximize the exact objective function by 
message updates that effectively perform some "asynchronous" coordinate descent steps. 
In the sequel, we use an illustrative toy example to explain the main ideas. 

Example 2. Consider the marginal MAP problem shown on the 
left, where the graph G is an undirected tree; the sum and max sets are 
A = {1,2} and B = {3,4}, respectively. We analyze how Algorithm 3 
and mixed-product BP in Algorithm 2 perform on this toy example, when 
both taking Bethe weights (pij = 1 for (ij) ^ E). 

Algorithm 3 (Jiang et al. (2011)). Since G is a tree, one can show that Algorithm 3 
(with Bethe weights) terminates after a full forward and backward iteration (e.g., messages 
passed along X3 — )• xi — )• X2 — )• X4 and then X4 — )• X2 — )• xi — )• x^). By tracking the messages, 
one can write its final decoded solution in a closed form, 

X3 = argmax max[exp(0(a;))], xX = argmax max[exp(0(a;))], 

On the other hand, the true marginal MAP solution is given by, 

X3 = argmaxmax^^ ^^[exp(0(a;))], X4 = argmaxmax^^ ^^[exp(0(a;))]. 

X\ X2 



m — (xi, 



Here, Algorithm 3 approximates the exact marginal MAP problem by rearranging the max 
and sum operators into an elimination order that makes the calculation easier. A similar 
property holds for the general case when G is undirected tree: Algorithm 3 (with Bethe 
weights) terminates in a finite number of steps, and its output solution x* effectively maxi- 
mizes an approximate objective obtained by reordering the max and sum operators along a 
tree-order (see Definition 4-1) that is rooted at node i. The performance of the algorithm 
should be related to the error caused by exchanging the order of max and sum operators. 
However, exact optimality guarantees are likely difficult to show because it maximizes an 
inexact objective function. In addition, since each component x* uses a different order of 
arrangement, and hence maximizes a different surrogate objective function, it is unclear 
whether the joint B -configuration x*^ = {x* : i G B} given by Algorithm 3 maximizes a 
single consistent objective function. 

Algorithm 2 (mixed-product). On the other hand, the mixed-product belief propa- 
gation in Algorithm 2 may not terminate in a finite number of steps, nor does it necessarily 
yield a closed form solution when G is an undirected tree. However, Algorithm 2 proceeds 



20 



in an attempt to optimize the exact objective function. In this toy example, we can show 
that the true solution is guaranteed to be a fixed point of Algorithm 2. Let bs^x^) be the 
mixed-belief on xs at the current iteration, and x"^ = argmax^.^ bslx^) its unique maxima. 
After a message sequence passed from X3 to X4, one can show that b/i{x4) and x^ update to 

0:4 = argmax64(x4),64(x4) = exp(e([x3, x^s])) = exp{Q{[x*3, x^]; 9)), 

X4 

X2 Xl 

where we maximize the exact objective function Q{[x3, X4]; 0) with fixed X3 = X3. Therefore, 
on this toy example, one sweep (X3 — X4 orx4 — X3) of Algorithm 2 is effectively performing 
a coordinate descent step, which monotonically improves the true objective function towards 
a local maximum. In more general models, Algorithm 2 differs from sequential coordinate 
descent, and does not guarantee monotonic convergence. But, it can be viewed as a "parallel" 
version of coordinate descent, which ensures the stronger local optimality guarantees shown 
in Theorem 9. 



6. Convergent Algorithms by Proximal Point Methods 

An obvious disadvantage of mixed-product BP is its lack of convergence guarantees, even 
when G is an undirected tree. In this section, we apply a proximal point approach (e.g.. 
Martinet, 1970; Rockafellar, 1976) to derive convergent algorithms that directly optimize 
our free energy objectives. Similar methods have been applied to standard sum-inference 
(Yuille, 2002) and max-inference (Ravikumar et al., 2010). 

For the purpose of illustration, we first consider the problem of maximizing the exact 
marginal MAP free energy, FmixiT,^) = i^j^) + Hj^^^{t). The proximal point algorithm 
works by iteratively optimizing a smoothed problem, 

T*+i = argmin{-F™,(T,6>) + A*L»(t||t*)}, 

where r* is the solution at iteration t, and A* is a positive coefficient. Here, -D(-||-) is a 
distance, called the proximal function, which forces t*+^ to be close to r*; typical choices 
of -D(-||-) are Euclidean or Bregman distances or ^/^-divergences (e.g., Teboulle, 1992; lusem 
and Teboulle, 1993). Proximal algorithms have nice convergence guarantees: the objective 
series {/(t*)} is guaranteed to be non-increasing at each iteration, and {r*} converges to an 
optimal solution, under some regularity conditions. See, e.g., Rockafellar (1976); Tseng and 
Bertsekas (1993); lusem and Teboulle (1993). The proximal algorithm is closely related to 
the majorize-minimize (MM) algorithm (Hunter and Lange, 2004) and the convex-concave 
procedure (Yuille, 2002). 

For our purpose, we take -D(-||-) to be a KL divergence between distributions on the 
max nodes, 

DiT\\T') =KUtb{xb)\\tU^b)) = Y.^BixB)log^^ 

XB '^BV^B) 

In this case, the proximal point algorithm reduces to Algorithm 4, which iteratively solves 
a smoothed free energy objective, with natural parameter 0* updated at each iteration. 
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Algorithm 4 Proximal Point Algorithm for Marginal MAP (Exact) 

Initialize local marginals t^. 
for iteration t do 

6»*+i =6> + A*logT*5, (34) 
T*+i = argmax{(T,0*+i) + Ha\b{t) + \'Hb{t)}, (35) 

end for 

Decoding: x* = argmaxrj(xi) for yi £ B. 



Intuitively, the proximal inner loop (35) essentially "adds back" the truncated entropy 
term Hb{t), while canceling its effect by adjusting 6 in the opposite direction. Typical 
choices of A* include A* = 1 (constant) and A* = 1/t (harmonic). Note that the proximal 
approach is distinct from an annealing method, which would require that the annealing 
coefficient vanish to zero. Interesting, if we take A* = 1, then the inner maximization 
problem (35) reduces to the standard log-partition function duality (4), corresponding to a 
pure marginalization task. This has the interpretation of transforming the marginal MAP 
problem into a sequence of standard sum-inference problems. 

In practice we approximate //^[^(t) and Hb{t) by pairwise entropy decomposition 
Ha\b{''') and Hb{t) in (19), respectively. If Hsij) is provably convex in the sense of Weiss 
et al. (2007), then the resulting approximate algorithm can be interpreted as a proximal 
algorithm that nicixiniizGS -^mza:'('^; ^) witli proxinicil function dcfinGci 

Dpair{T\\T^) = ^^KiKL[Ti{Xi)\\Tf{Xi)] + ^ /ti^jKL[(Tij (Xj |Xj ) | |r°. (Xj | Xj)] , 
i&B {ij)eEB 

where are positive, and satisfy pi = Ki + J2ked '^k-^i Pij — '^i^j + i^j^i- 

this case. Algorithm 4 still inherits proximal methods' nice convergence guarantees. 

An interesting special case is when both i/^|^(T) and Hb{t) are approximated by a 
Bethe approximation, which is provably convex only in some special cases, such as when G 
is tree structured. However, we find in practice that this approximation gives very accurate 
solutions, even on general loopy graphs where its convergence is no longer theoretically 
guaranteed. 

7. Connections to EM 

A natural algorithm for solving the marginal MAP problem is to use the expectation- 
maximization (EM) algorithm, by treating as the hidden variables and a;^ as the "pa- 
rameters" to be maximized. In this section, we show that the EM algorithm can be seen as 
a coordinate ascent algorithm on a mean field variant of our framework. 
We start by introducing a "non-convex" generalization of Theorem 2. 

Corollary 10. Let M." be the subset of the marginal polytope M corresponding to the dis- 
tributions in which xb are clamped to some deterministic values, that is, 

M° = {r G M : Bx*^ G Xb, such that t{xb) = 1{xb = x*b)]. 
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Then the dual optimization (11) remains exact if the marginal polytope M is replaced by 
any N satisfying M° C N C M, that is, 

^AB = max{(0, r) + Ha\b{t)}. (36) 

Proof. For an arbitrary marginal MAP solution x*^, the r* with t*{x) = p{x\xb = x*^; 6) 
is an optimum of (11) and satisfies r* G M°. Therefore, restricting the optimization on M° 
(or any N) does not change the maximum value of the objective function. □ 

Remark. Among all N satisfying M° C N C M, the marginal polytope M is the smallest 
(and the unique) convex set that includes M°, i.e., it is the convex hull of M°. 

To connect to EM, we define M^, the set of distributions in which xa and xb are 
independent, that is, = {r G M\t{x) = t{xa)t{xb)}. Since M° C C M, the dual 
optimization (11) remains exact when restricted to M^, that is, 

^AB{e)= max{(6>,T) + /7^|B(T)} = max{(6»,T) + /^A(r)}, (37) 

where the second equality holds because Ha\b{'^) = Ha{t) for r G M^. 

Although is no longer a convex set, it is natural to consider a coordinate update 
that alternately optimizes t{xa) and t{xb), 

Updating sum part : t'^^ ^ argmax (E^t^ (0), r^i) + //^(''"a), 

ta'" 



M (38) 
Updating max part : ^ argmax (E^t+i (0), r^). 



where M.a and are the marginal polytopes over xa and xb, respectively. Note that 
the sum and max step each happen to be the dual of a sum-inference and max-inference 
problem, respectively. If we go back to the primal, and update the primal configuration xb 
instead of tb, (38) can be rewritten into 

E step : t'I'^Hxa) ^ p{xa\x%; 6), 
M step : a;^+^ ^ arg maxE^n+i (0), 

which is exactly the EM update, viewing xb ^ parameters and xa as hidden variables. 
Similar connections between EM and the coordinate ascent method on variational objectives 
has been discussed in Neal and Hinton (1998) and Wainwright and Jordan (2008). 

When the E-step or M-step are intractable, one can insert various approximations. In 
particular, approximating by a mean-field inner bound M^''^ leads to variational EM. An 
interesting observation is obtained by using a Bethe approximation (5) to solve the E-step 
and a linear relaxation to solve the M-step; in this case, the EM-like update is equivalent 
to solving 

max {(0,r) + Vi/i - V (39) 

where L(G)^ is the subset of L(G) in which Tij{xi,Xj) = Ti{xi)Tj{xj) for {ij) G Bab- 
Equivalently, L(G)^ is the subset of L(G) in which Iij = for {ij) G Bab- Therefore, (39) 
can be treated as an special case of (18) by taking pij — )• +oo, forcing the solution r* to 
fall into L(G)^. As we discussed in Section 4.3, EM represents an extreme of the tradeoff 
between convexity and integrality implied by Theorem 5, which strongly encourages vertex 
solutions by sacrificing convexity, and hence is likely to become stuck in local optima. 
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8. Junction Graph Belief Propagation for Marginal MAP 

In the above, we have restricted the discussion to pairwise models and pairwise entropy 
approximations, mainly for the purpose of clarity. In this section, we extend our algorithms 
to leverage higher order cliques, based on the junction graph representation (Mateescu 
et al., 2010; Koller and Friedman, 2009). Other higher order methods, like generalized BP 
(Yedidia et al., 2005) or their convex variants (Wainwright et al., 2005a; Wiegerinck, 2005), 
can be derived similarly. 

A cluster graph is a graph of subsets of variables (called clusters) . Formally, it is a triple 
{G,C,S), where G = (V,<5) is an undirected graph, with each node k £V associated with a 
cluster Ck G C, and each edge (kl) G S with a subset Ski G S (called separators) satisfying 
Ski ^ Cfc n q. We assume that C subsumes the index set I, that is, for any a £ I, we can 
assign it with a E C, denoted c[a], such that a C Ck- In this case, we can reparameterize 

= {da - a £ 1} into 6 = {Oc^. : k £ V} hy taking Oc^. = 0a, without changing the 

a: c[a]=Ck 

distribution. Therefore, we simply assume C = I in this paper without loss of generality. 
A cluster graph is called a junction graph if it satisfies the running intersection property - 
for each i G V, the induced sub-graph consisting of the clusters and separators that include 

1 is a connected tree. A junction graph is a junction tree if ^ is a tree structured graph. 

To approximate the variational dual form, we first replace M with a higher order locally 
consistent polytope L(^), which is the set of local marginals that are consistent on the 
intersections of the clusters and separators, that is, 

UQ) = {t: Y1 r.^i^cj = t{xs,,),t,,{xc,) > O,ior y k£V,{kl)££}. 

Clearly, we have M C L(^) and that L(^) is tighter than the pairwise polytope L(G) we 
used previously. 

We then approximate the joint entropy term by a linear combination of the entropies 
over the clusters and separators, 

fcev {kl)&£ 

where Hc^. (r) and Hg^i (r ) are the entropy of the local marginals Tc^, and Ts^., , respectively. 
Further, we approximate Hb{t) by a slightly more restrictive entropy decomposition, 

ifB(r)« j;//.,(r), 

fcev 

where {vr/; : /c G V} is a non-overlapping partition of the max nodes B satisfying vr^ C 
for \/k G V. In other words, tt represents an assignment of each max node Xb £ B into a 
cluster k with x^ £ tt^. Let B be the set of clusters k £ V for which tt^ 7^ 0, and call B the 
max- clusters; correspondingly, call A = V\B the sum- clusters. See Fig. 3 for an example. 
Overall, the marginal MAP dual form in (11) is approximated by 

?Sa)^^^'^^ + E^^^M + E^^.K.W- E (40) 

^ ' k£A k£B {ki)ee 
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(a) (b) 

Figure 3: (a) An example of marginal MAP problem, where d, c, e are sum nodes (shaded) 
and a, b, f are max nodes, (b) A junction graph of (a). Selecting a partitioning of 
max nodes, TTMe = "^bef = 0) '^abc = {o-^ b}, and vrfce/ = {/}, results in {bde}, {bee} 
being sum clusters (shaded) and {abe}, {bef} being max clusters. 



Algorithm 5 Mixed-product Junction Graph BP 



1. Passing messages between clusters on the junction graph: 

A^AuB: rrik-^i (X V'cfc^i-^fcV) (Sum-product message) 

B^AUB: mk^i (X ^ (V'cfcWi^fcv) • If^^Tr^ £ 'V-^j^], (Argmax-product message) 

^'^k\'>ki 

where X*^ = argmax ^ bk{xc^), 
^'=k\-^k 

bk{xck) = Ak n ^k'^k and m^k\l = "i^'-^fc- 
k'eAf{k) k'£Af{k)\{l} 



2. Decoding: a;*^ = argmax bk{xc^,) for VA; G B. 

'=k\^k 



^^k 



where H^^\^^{t) = Hc^{t) — Ht^^.[t). Optimizing (40) using a method similar to the deriva- 
tion of mixed-product BP in Algorithm 2, we obtain a "mixed-product" junction graph 
belief propagation, given in Algorithm 5. 

Similarly to our mixed- product BP in Algorithm 2, Algorithm 5 also admits an intuitive 
reparameterization interpretation and a strong local optimality guarantee. Algorithm 5 can 
be seen as a special case of a more general junction graph BP algorithm derived in Liu and 
Ihler (2012) for solving maximum expected utility tasks in decision networks. For more 
details, we refer the reader to that work. 
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9. Experiments 



We illustrate our algorithms on both simulated models and more realistic diagnostic Bayesian 
networks taken from the UAI08 inference challenge. We show that our Bethe approxima- 
tion algorithms perform best among all the tested algorithms, including Jiang et al. (2011)'s 
hybrid message passing and a state-of-the-art local search algorithm (Park and Darwiche, 



We implement our mixed-product BP in Algorithm 2 with Bethe weights (mix-product 
(Bethe)), the regular sum-product BP (sum-product), max-product BP (max-product) 
and Jiang et al. (2011)'s hybrid message passing (with Bethe weights) in Algorithm 3 
(Jiang' s method), where the solutions are all extracted by maximizing the singleton marginals 
of the max nodes. For all these algorithms, we run a maximum of 50 iterations; in case 
they fail to converge, we run 100 additional iterations with a damping coefficient of 0.1. We 
initialize all these algorithms with 5 random initializations and pick the best solution; for 
mix-product (Bethe) and Jiang's method, we run an additional trial initialized using the 
sum-product messages, which was reported to perform well in Park and Darwiche (2004) 
and Jiang et al. (2011). We also run the proximal point version of mixed-product BP with 
Bethe weights (Proximal (Bethe) ), which is Algorithm 4 with both Hj^^^{t) and Hb{t) 
approximated by Bethe approximations. 

We also implement the TRW approximation, but only using the convergent proximal 
point algorithm, because the TRW upper bounds are valid only when the algorithms con- 
verge. The TRW weights of Ha\b s-rs constructed by first (randomly) selecting spanning 
trees of Ga, and then augmenting each spanning tree with one uniformly selected edge in 
Oab', the TRW weights of Hb{'^) are constructed to be provably convex, using the method 
of TRW-S in Kolmogorov (2006). We run all the proximal point algorithms for a maximum 
of 100 iterations, with a maximum of 5 iterations of weighted message passing updates 
(22)-(23) for the inner loops (with 5 additional damping with 0.1 damping coefficient). 

In addition, we compare our algorithms with Samlam, which is a state-of-the-art imple- 
mentation of the local search algorithm for marginal MAP (Park and Darwiche, 2004); we 
use its default Taboo search method with a maximum of 500 searching steps, and report the 
best results among 5 trials with random initializations, and one additional trial initialized 
by its default method (which sequentially initializes Xi by maximizing p(j;j|xpa-) along some 
predefined order). 

We also implement an EM algorithm, whose expectation and maximization steps are 
approximated by sum-product and max-product BP, respectively. We run EM with 5 
random initializations and one initialization by sum-product marginals, and pick the best 
solution. 



Simulated Models. We consider pairwise models over discrete random variables taking 
values in {—1, 0, +1}", 



The value tables of 9i and Oij are randomly generated from normal distribution, 9i{k) ~ 
Normal(0, 0.01), 9ij{k,l) ~ Normal ( 0, o"^), where a controls the strength of coupling. Our 
results are averaged on 1000 randomly generated sets of parameters. 



2004). 




{ij)(^E 
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We consider different choices of graph structures and max / sum node patterns: 

1. Hidden Markov chain with 20 nodes, as shown in Fig. 1. 

2. Latent tree models. We generate random trees of size 50, by finding the minimum span- 
ning trees of random symmetric matrices with elements drawn from Uniform([0, 1]). 
We take the leaf nodes to be max nodes, and the non-leaf nodes to be sum nodes. See 
Fig. 5(a) for a typical example. 

3. 10 X 10 Grid with max and sum nodes distributed in two opposite chess board patterns 
shown in Fig. 6(a) and Fig. 7(a), respectively. In Fig. 6(a), the sum part is a loopy 
graph, and the max part is a (fully disconnected) tree; in Fig. 7(a), the max and sum 
parts are flipped. 

The results on the hidden Markov chain are shown in Fig. 4, where we plot in panel (a) 
different algorithms' percentages of obtaining the globally optimal solutions among 1000 
random trials, and in panel (b) their relative energy errors defined by Q(xb; 0) — Qix*^; 6), 
where is the solution returned by the algorithms, and x*^ is the true optimum. 

The results of the latent tree models and the two types of 2D grids are shown in Fig. 5, 
Fig. 6 and Fig. 7, respectively. Since the globally optimal solution x*^ is not tractable to 
calculate in these cases, we report the approximate relative error defined by Q{xb;0) — 
Q{xb',0), where xb is the best solution we found across all algorithms. 

Diagnostic Bayesian Networks. We also test our algorithms on two diagnostic 
Bayesian networks taken from the UAI08 Inference Challenge, where we construct marginal 
MAP problems by randomly selecting varying percentages of nodes to be max nodes. Since 
these models are not pairwise, we implement the junction graph versions of mix-product 
(Bethe) and proximal (Bethe) shown in Section 8. Fig. 8 shows the approximate relative 
errors of our algorithms and local search (Samlam) as the percentage of the max nodes 
varies. 

Insights. Across all the experiments, we find that mix-product (Bethe), proximal 
(Bethe) and local search (Samlam) significantly outperform all the other algorithms, 
while proximal (Bethe) outperforms the two others in some circumstances. In the hidden 
Markov chain example in Fig. 4, these three algorithms almost always (with probability 
> 99%) find the globally optimal solutions. However, the performance of Samlam tends to 
degenerate when the max part has loopy dependency structures (see Fig. 7), or when the 
number of max nodes is large (see Fig. 8), both of which make it difficult to explore the 
solution space by local search. On the other hand, mix-product (Bethe) tends to degen- 
erate as the coupling strength a increases (see Fig. 7), probably because its convergence 
gets worse as a increases. 

We note that our TRW approximation gives much less accurate solutions than the other 
algorithms, but is able to provide an upper bound on the optimal energy. Similar phenomena 
have been observed for TRW-BP in standard max- and sum- inference. 

The hybrid message passing of Jiang et al. (2011) is significantly worse than mix-product 
(Bethe) , proximal (Bethe) and local search (Samlam) , but is otherwise the best among 
the remaining algorithms. EM performs similarly to (or sometimes worse than) Jiang's 
method. 
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Figure 4: Results on the hidden Markov chain in Fig. 1 (best viewed in color), (a) different 
algorithms' probabilities of obtaining the globally optimal solution among 1000 
random trials. Mix-product (Bethe), Proximal (Bethe) and Local Search 
(Samlam) almost always (with probability > 99%) find the optimal solution, 
(b) The relative energy errors of the different algorithms, and the upper bounds 
obtained by Proximal (TRW) . 




(a) (b) 

Figure 5: (a) A typical latent tree model, whose leaf nodes are taken to be max nodes 
(white) and non-leaf nodes to be sum nodes (shaded), (b) The approximate 
relative energy errors of different algorithms, and the upper bound obtained by 
Proximal (TRW). 



The regular max-product BP and sum-product BP are among the worst of the tested 
algorithms, indicating the danger of approximating mixed-inference by pure max- or sum- 
inference. Interestingly, the performances of max-product BP and sum-product BP have 
opposite trends: In Fig. 4, Fig. 5 and Fig. 6, where the max parts are fully disconnected 
and the sum parts are connected and loopy, max-product BP usually performs worse than 
sum-product BP, but gets better as the coupling strength a increases; sum-product BP, on 
the other hand, tends to degenerate as a increases. In Fig. 7, where the max / sum pattern 
is reversed (resulting in a larger, loopier max subgraph), max-product BP performs better 
than sum-product BP. 
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Figure 6: (a) A marginal MAP problem defined on a 10 x 10 Ising grid, with shaded sum 
nodes and unshaded max nodes; note that the sum part is a loopy graph, while 
max part is fully disconnected, (b) The approximate relative errors of different 
algorithms as a function of coupling strength a. 




Figure 7: (a) A marginal MAP problem defined on a 10 x 10 Ising grid, but with max / sum 
part exactly opposite to that in Fig. 6; note that the max part is loopy, while the 
sum part is fully disconnected in this case, (b) The approximate relative errors 
of different algorithms as a function of coupling strength. 



10. Conclusion and Further Directions 

We have presented a general variational framework for solving marginal MAP problems 
approximately, opening new doors for developing efficient algorithms. In particular, we 
show that our proposed "mixed-product" BP admits appealing theoretical properties and 
performs well in practice. 

Potential future directions include improving the performance of the truncated TRW 
approximation by optimizing weights, deriving optimality conditions that may be applicable 
even when the sum component does not form a tree, studying the convergent properties of 
mixed-product BP, and leveraging our results to learn hidden variable models for data. 
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(a) The structure of Diagnostic BN-2, with 50% randomly selected sum nodes shaded. 




(b) Diagnostic BN-1 (c) Diagnostic BN-2 



Figure 8: The results on two diagnostic Bayesian networks (BNs) in the UAI08 inference 
challenge, (a) The Diagnostic BN-2 network, (b)-(c) The performances of algo- 
rithms on the two BNs as a function of the percentage of max nodes. Results are 
averaged over 100 random trials. 
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Appendix A. Proof of Theorem 5 

Proof, (i). For r G M", the objective function in (18) equals 

Ftree{r,9) = {6,t) + - hj - ^ pijhj 

= {e,T) + Y,H.-Y.^^^ 

i&V {ij)£EA 
= {e,T) +HA\B{r) 



(41) 
(42) 



where the equaUty in (41) is because lij = if V(ij) G Bab-, and the equaUty in (42) is 
because the sum part Ga is a tree and we have the tree decomposition Ha\b = X^iGV ~ 



;? . li-i- Therefore we have 



^tree{d) = max Ftree{r , 6) > max Ftree{r , 6) = maKFmix{r,6) = ^ab{0), (43) 
TeL(G) TeM° TeM° 

where the inequality is because M" C M C L(G). 

If there exists x*^ such that Q{x*^; 6) = ^tree{d), then we have 

Q{x*b;G) = ^tree{e) > $ab(6') = maxQ(ccB; e). 

xb 

This proves that x*^ is a globally optimal marginal MAP solution. 

(ii). Because T*{xi) for Wi £ B are deterministic, and the sum part Ga is a tree, we 
have that r* S M°. Therefore the inequality in (43) is tight, and we can conclude the proof 
by using Corollary 10. □ 

Appendix B. Proof of Theorem 9 

Proof. By Theorem 8, the beliefs {bi,bij} should satisfy the reparameterization property 
in (30) and the consistency conditions in (31)-(33). Without loss of generality, we assume 
{bi,bij} are normalized such that 6j(xj) = 1 for i £ A and max^,. bi{xi) = 1 for i £ B. 

I) For simplicity, we first prove the case of C = B, when G = GcuA itself is a semi 
A-B tree, and the theorem implies that x*^ is a global optimum. By the reparameterization 
condition, we have 



p{x) =Pb{xb)pa\b{x) 



where 



Pb{xb) = W bi{xi) 

ieB {ij)&EB 

Pa\b{x) = Y{bi{xi) Jl 



bij {X'i, Xj) 

bi{xi)bj{xj) 

bij {xi , Xj ) 

bi{xi)bj{xj) 



Pij 



Pij 



n 

{ij)edAB 



bij {Xi , Xj ) 

bi{xi)bj{xj) 



Pij 



(44) 

(45) 
(46) 
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Note we have 



p{xb) = "^Pix) = '^Pb(.xb)pa\b(.x) = Pb(.xb)'^Pa\b(.x). 



XA 



XA 



XA 



We just need to show that x*^ maximizes pb{xb) and X^g.^ respectively. 

First, since pb{xb) involves only the max nodes, a standard MAP analysis applies. Be- 
cause the max part of the beliefs, {hi, bij : (ij) G Eb}, satisfy the standard max-consistency 
conditions, and the corresponding TRW weights {pij : (ij) G Eb} are provably convex by 
assumption, we establish that x*^ is the MAP solution oIpb{xb) by Theorem 1 of Weiss 
et al. (2007). 

Secondly, to show that x*^ also maximizes Pa\b{^) requires the combination of the 
mixed- consistency and sum-consistency conditions. Since G is a semi A-B tree, we denote 
by TTj the unique parent node of z (vTj = if i is a root). In addition, let Da be the subset 
of A whose parent nodes are in B, that is, Oa = {i G A: Hi £ B}. Equation (46) can be 
reformed into 



i&A\dA iedA 



bi^TTi {,Xi, '^TTi ) 






b-IVi {X-Ki ) 




bi{xi) 



(47) 



where we used the fact that pij = 1 for {ij) £ Ea- Therefore, we have for any xb £ Xb, 



^^&A\^A ''^'^^'^^^ ieOA 



bi,Hi (xi 1 X-jTi ) 



b-Ki [Xm 



bi{xi) 



^nfE 

= 1, 



"I pi, 71 



b-Ki i^TTi ) 

6{ If ■ IT Pi,7r,' 

bm {x-Ki) 



bi{xi) 



i-Pi,ii 



^bi{xi) 



Xi 



(48) 

(49) 
(50) 



where the equality in (48) eliminates (by summation) all the interior nodes in A. The 
inequality in (49) follows from Holder's inequality. Finally, the equality in (50) holds because 
all the sum part of beliefs {6j, bij : {ij) G Ea} satisfies the sum-consistency (31). 

On the other hand, for any (i,7rj) G Oab, because x* = argmax^. 6jr. (xj^.), we have 
bi,TTi{xi, x%J = bi{xi) by the mixed-consistency condition (33). Therefore, 



Y,PA\B(.i^A,x*B]) = YlYl 



XA 



bi,TTi {xi, x^.) 



n 

1. 



bn^XnJ 

Pi,7r^- 



Pi,7rj 



bi{xi) 



b-,Ti{xli) 



^bi{xi) 



(51) 

(52) 
(53) 



Combining (50) and (53), we have Pa\b{x) < Y^x^Pa\b{[xa,x*^]) = 1 for any xb G 
Xb, that is, x*^ maximizes Y2xa Pa\b{x)- This finishes the proof for the case C = B. 
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II) In the case of C ^ B, let D = B \ C . We decompose p{x) into 
p{x) = Pb{[xc,Xd])pa\c(.[xa, Xc])fAD ( [xA , Xd] ) 
where pb{xb) and Pa\b{'^) defined similarly to (45) and (46), 



Pb{xb) = Wbi{xi) 



■leB 



bi{xi)bj{xj) 



Pa\c{[xa, xc]) = W bi{xi) JJ 



bij {X'i, Xj ) 

bi{xi)bj{xj] 



pij 



n 



bi{xi)bj{xj) 



(54) 
(55) 



where vrj is the parent node of i in the semi A-B tree G^iuc and 9ac is set of edges across 
A and C, that is, Sac = {(ij) & E: i G A,j G C}. The term rAoix) is defined as 



rAD{[xA,xn]) = Yi 



b'ij {xi , Xj) 
bi{xi)bj{xj) 



pij 



(56) 



where similarly Sad is the set of edges across A and D. 

Because x* = arg max^. . 6j (xj ) for j £ D, we have 6jj(xj,x!|) — bii^x^ ) for (ij) G 
c^yl_D) j £ -C by the mixed-consistency condition in (33). Therefore, one can show that 
^ad{[xa-,x*j^]) = 1, and hence 

p{[xA, Xc, ) = pb{[xc, x*j;)\)pa\c{[xa, xc]). 

The remainder of the proof is similar to that for the case C = B: by the analysis in Weiss 
et al. (2007), it follows that x*fj € arg max^,^ p([a;c, a;^]), and we have previously shown 
that x*(^ S argmaXg,^ YIxaPMcHxa^xc])- This establishes that maximizes 

^p{\xA, xc,x*j^]) = p{[xc,x*i:)\)^pA\c{[xA, Xc]), 



XA 



XA 



which concludes the proof. 



□ 
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